arXiv:1502.00146vl [math.ST] 31 Jan 2015 


Matrix completion by singular value thresholding: 

sharp bounds 

Olga Klopp 

CREST and MODAL’X, University Paris Quest 
February 3, 2015 

Abstract 

We consider the matrix completion problem where the aim is to esti¬ 
mate a large data matrix for which only a relatively small random subset 
of its entries is observed. Quite popular approaches to matrix completion 
problem are iterative thresholding methods. In spite of their empirical 
success, the theoretical guarantees of such iterative thresholding methods 
are poorly understood. The goal of this paper is to provide strong theo¬ 
retical guarantees, similar to those obtained for nuclear-norm penalization 
methods and one step thresholding methods, for an iterative thresholding 
algorithm which is a modification of the softimpute algorithm. An im¬ 
portant consequence of our result is the exact minimax optimal rates of 
convergence for matrix completion problem which were known until know 
only up to a logarithmic factor. 

Aeyworrfs: matrix completion, low rank matrix estimation, minimax opti¬ 
mality 

AMS 2000 subject classification: 62J99, 62H12, 60B20, 15A83 

1 Introduction 

Suppose that we observe a small subset of entries of a large data matrix. The 
problem of inferring the many missing entries from this small set of observa¬ 
tions is known as the matrix completion problem. This problem has attracted 
considerable attention in the past five years. The first works [7, 6, 5, 11, 21] 
introduce nuclear-norm minimization method. A different approach, called OP- 
TISPACE has been proposed in [12, 13]. More recently, a method based on 
max-norm minimization was studied in [4, 10]. Other methods include, for ex¬ 
ample, GROUSE (Grassmannian Rank-One Update Subspace Estimation) [1] 
and orthogonal rank-one matrix pursuit [25]. 
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A quite popular direction in the matrix completion literature are the thresh¬ 
olding methods which can be divided in two groups: one-step thresholding meth¬ 
ods and iterative thresholding methods. Strong theoretical guarantees were ob¬ 
tained for one-step thresholding procedures. For example, Koltchinskii et al in 
[17] introduce a soft-thresholding method and show that it is minimax optimal 
up to a logarithmic factor. In [14] Klopp consider a hard thresholding procee- 
dure. Chatterjee [8] propose an universal singular value thresholding that can 
be applied to a large number of matrix estimation problems, including matrix 
completion. Despite strong theoretical guarantees these one-step thresholding 
methods has two important drawbacks: they show poor behavior in practice 
and only work under the uniform sampling distribution which is not realistic in 
many practical situations. 

Much better practical performances have been shown by iterative threshold¬ 
ing methods. For example, in [3], Cai et al propose a first-order singular value 
thresholding algorithm SVT which approximately solves the nuclear norm min¬ 
imization problem. In [19], Mazmuder et al introduce softimpute algorithm, 
softimpute produces a sequence of solutions that converges to a solution of 
the nuclear norm regularized least-squares problem when the number of iter¬ 
ations goes to infinity. These iterative thresholding algorithms are simple to 
implement, scale to relatively large matrices and in practice achieve competi¬ 
tive errors compared to the state-of-the-art algorithms. More recently Dhanjal 
et al [9] propose an improvement for the softimpute algorithm using random¬ 
ized SVDs along with a novel updating method. This improvement allows to 
bypass the bottleneck in the algorithm which consists in the use of the singular 
value decomposition of a large matrix at each iteration. 

The majority of existing algorithms for matrix completion are batch meth¬ 
ods, that is, they operate on the full data matrix. However in some applications 
such as recommendation systems or localization in sensor networks we observe 
a sequence of data matrix Mi,..., Mt reviled sequentially where from Mt to 
Mt+i we add new observations. In such situations the predictive rule should 
be refined incrementally. One advantage of iterative thresholding algorithms is 
that they can be adapted to such sequential learning, see for example [9]. 

In spite of their empirical success, the theoretical guarantees of such iter¬ 
ative thresholding methods are poorly understood. The goal of this paper is 
to provide strong theoretical guarantees, similar to those obtained for nuclear- 
norm penalization methods (see, for example [20, 15]) and one step thresholding 
methods (see [17, 14, 8]) for a modification of the softimpute algorithm. 

1.1 Contributions and Related Work 

The contributions of the present paper to the theoretical study of the modified 
softimpute algorithm are multifaceted. In Section 3.2 we prove an upper bound 
on the estimation error of the output M of our algorithm. Let Mq € 
be the unknown matrix of interest. Suppose, for simplicity, that each entry is 
observed with the same probability p, then we prove the following upper bound 
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on the estimation error of M 


\\M - MqWI ^ rank(Mo) 


( 1 ) 


TO 1 TO 2 pmin(mi, m 2 ) 


Here the symbol < means that the inequality holds up to a multiplicative numer¬ 
ical constant. To the best of our knowledge, the upper bound on the estimation 
error given by (1) is strictly better than all upper bounds available in matrix 
completion literature. 

For instance, for the same setting, Chatterjee in [8] obtains the following 
larger bound 



On the other hand, [17, 20, 15], among some other papers, consider a slightly 
different setting where the matrix completion problem is viewed as a particular 
case of the trace regression model. In this setting the number of observations 
n is fixed. The drawback here is that in this model each entry can be observed 
multiple times which is not the case in a large number of practical situations. We 
consider a different setting where each entry can be observed at most once (see 
Section 2.1). However, it is easy to see that these two settings are closely related 
if we put n = pmim 2 . Comparing to (1), the bounds obtained in [17, 20, 15] 
have an additional log(di -|- ^ 2 ) factor. 

Koltchinskii et al in [17] obtained lower bounds for the estimation error 
without this additional log(di -I- ^2) factor. So our result answer the important 
theoretical question what is the exact minimax rate of convergence for matrix 
completion problem. As the lower bound in [17] is obtained for a different 
setting, in Section 4 we adapt their proof to our setting, showing that the 
minimax rate of convergence for matrix completion problem is given by (1) and 
that the estimator produced by our algorithm is minimax optimal. Note that 
our techniques can be adapted to the setting considered in [17, 20, 15] and lead 
to an upper bound without the additional log(di - 1 -^ 2 ) factor in this setting also. 

Another important point is that a large part of matrix completion literature 
consider uniform sampling at random setting where each entry is observed with 
the same probability p. In many applications, such as recommendation systems, 
this assumption is not realistic. The theoretical analysis in the present paper is 
carried out for quite general sampling distributions and show that our iterative 
thresholding algorithm has good performances in such situations. Finally our 
results give theoretical insights for the chose of the parameters in the modified 
soft Impute algorithm. 

1.2 Organisation of the paper 

The remainder of this paper is organized as follows. In Section 2.1 we introduce 
our model and the assumptions on the sampling scheme. For the reader’s con¬ 
venience, we collect notation which we use throughout the paper in Section 2.2. 
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In Section 3.1 we present a modification of the sof timpute algorithm for matrix 
completion. The upper bounds on the estimation error are derived in Section 
3.2. Finally the lower bounds are obtained in Section 4 and the Appendix 
contains the proofs. 

2 Preliminaries 

2.1 Model and sampling scheme 

Suppose that we observe a relatively small number of entries of a data matrix 

X = Mq + E. (2) 

Here Mq = {rriij) G jg unknown matrix of interest and E = {^ij) G 

]R'"iX "‘2 jg ^]^g matrix containing the noise. We assume that the noise variables 
are independent, zero mean and bounded: 

Assumption 1. = 0, = cr^ and there exists a positive constant 

6 > 0 such that 

max I < b. 

We suppose that each entry of X is observed independently of the other 
entries. For the entry (i,j) G [mi] x [m 2 ], we denote the probability to be 
observed by Tr^. Let rjij be the independent Bernoulli variables with parameters 
7Tij and Uij = rjij (rriij + ^ij ) • Then, Y = ) is the matrix containing our 

observations. We denote by H the random set of observed indices. 

In the simplest situation each coefficient is observed with the same probabil¬ 
ity, i.e. for every (f, j) G [mi] x [m 2 ], = p. Unfortunately, such an assumption 

on the sampling distribution is not realistic in many practical applications. In 
the present paper, we consider general sampling model. We suppose that each 
coefficient is observed with a positive probability: 

Assumption 2. There exists p > 0 such that for any {i,j) G {1,... ,mi} x 
{I,...,m 2 } 

TTij > p. 

For any A = (Ay) G define the weighted by TTy Frobenius norm 

of A 

I|2l|li2(n) = T^'ijAy. 

(iJ) 

Assumption 2 implies that 

PllL(n)>p-'Pll2- (3) 

We denote the column and row marginals by 

mi m2 

TT. 1 — y j and — 'y ^ '^'^3 * 

z=l 3 = 1 
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Suppose that we know an upper bound L on it’s maximum: 

max( tt.j,TT^.) < L. (4) 


Note that we can easily get an estimation on this upper bound using the em¬ 
pirical frequencies 


/ v-i —1 


2.2 Notation 

We provide a brief summary of the notation used throughout this paper. Let 
A, B be matrices in 

• For a matrix A, A^j is its (i, j)th entry. 

• We denote by 5'A(fF) = UD\V' the soft-thresholding operator where 
Dx = diag [(di — A)+,..., (dr — A)-|-], UDV is the SVD of W, D = 
diag [di,..., dr] and t+ = max(t, 0). 

• For any set I, |/| denotes its cardinal and / its complement. Let aV b = 
max(a, b) and a A 6 = min(a, b). 

• For two matrices A^B G define the scalar product 

{A,B) = tifA^B). 

• We denote by ||^|l 2 the usual ^ 2 —norm. Additionally, we use the following 
matrix norms: ||A||, is the nuclear norm (the sum of singular values), 
||A|| is the operator norm (the largest singular value), ||A||oo is the largest 
absolute value of the entries: 

PIloo = max I Aij I . 


• TTij is the probability to observe the (z,j)-th element. For j = 1.. .m 2 , 

mi 1712 

TT.j = E TTy and for i = 1... mi, Tr^. = S TTy. We have that 

i=l j=l 

max {TT.j, TTi.) < L. 

• Let M = max(mi,m 2 ), m = min(mi,m 2 ) and d = mi -I-m 2 . 

• Let / C {1,... mi} X {1,... m 2 } be a subset of indices. Given a matrix A = 
(Ay), we define its restriction on /, A/, in the following way: (A/)j^- = Ay 
if {ij) G / and {Ai)^j = 0 if not. 
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• Let {eij} be an i.i.d. Rademacher sequence and Xij = ei{mi)e*{m 2 ) 
where ek{l) are the canonical basis vectors in We define 




and 


ihj) 


ihj) 


3 The Singular Value Thresholding Algorithm 


In this section we introduce an iterative singular value thresholding algorithm 
and discuss its theoretical properties. We show that it enjoys strong theoretical 
guarantees and, unlike one-step thresholding procedures, is well adapted for 
general non-uniform sampling distributions. 

3.1 Algorithm 

Our algorithm is based on the soft Impute algorithm proposed by Mazumder et 
al in [19]. Softimpute algorithm is inspired by SVD-Impute of Troyanskaya et 
al [23]. It alternates between imputing the missing values from a current SVD, 
and updating the SVD using the data matrix. 


Algorithm 1 


Require : Matrix V, regularization parameter A and a, an upper bound on 
the sup-norm of Mq. 


1 . = 0 


2. (a) Repeat 


(i) Compute ^ Sx{Y + 

(ii) If < A/3 and < a exit. 

(iii) Put 


new 


new 


' if < a 


= a 


if > a 


( 6 ) 


—a 


if < -a. 


(b) Assign M ^ 
3. Output M. 
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This algorithm repeatedly replaces the missing entries with the current guess, 
update the guess by solving 

Mnew g jninimize /;,(M) = ^11^^+ “ M\\l + A||M|U (7) 

and truncating Let us denote by {Mk)k>o the sequence of solutions 

produced by Algorithm 1. We have the following result : 

Lemma 1. For the successive differences of the sequence {Mk)k>o we have that 

\\M'^+^ask^Q ( 8 ) 

which implies 

||(M'=+i^ 0 and -)■ 0 as k ^ 0. (9) 

3.2 Upper bound on the estimation error 

In this section we derive an upper bound on the estimation error of M produced 
by Algorithm 1. This bound is non-asymptotic and implies, in particular, that 
the proposed estimator is minimax optimal. We start by a general result which 
is proven in Appendix A. 

Theorem 2. Let Assumptions 1 and 2 be satisfied and ||Mo||oo < a. Assume 
that X > 3 ||S||. Then, with probability at least 1 — 8/d, 

\\M - Molli^(n) < Cp-^ {rank(Mo) (a^ + (E (||Sk||))') + + log(d)} . 

where d = mi + m 2 . 

Using Assumption 2, Theorem 2 implies the following bound on the estima¬ 
tion error measured in normalized Frobenius norm 

Corollary 3. Under assumptions of Theorem 2 and with probability at least 
1 - 8/d, 

^-^|rank(Mo) (x^+a^ (E (||Eij||))2) +a^ + \ogid)}. 
mim2 p mim2 t V / J 

In order to get a bound in a closed form we need to obtain a suitable upper 
bounds on E(||Ei{||) and, with probability close to 1, on ||E||. 

Lemma 4. Suppose that are independent and satisfy Assumption 1. Then, 
there exists absolute constants c* ,C* >0 such that, for allt>0 with probability 
at least 1 — me“* we have 


||E|| < 8aV^+c*bt 


( 10 ) 
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where L < 1 is defined in (4). 

Moreover, we have 

EllEflII <C*(yL+ • (11) 

This Lemma is proven in Appendix F. 

Taking t = yj2 log(d) in Lemma 4, we get that with probability at least 

1 - IM 

||E|| < + c*b v'21og(d), 

then, we can choose 


A = 3 (sctVm + c*bs/2 log(d)) . (12) 

With this choice of A we obtain the following Theorem. 

Theorem 5. Let Assumptions 1 and 2 be satisfied and ||Mo||oo < o,. Then, 
with probability at least 1 — 8/d, 

||M- < Cp~^ rank(Mo) |(a V ct)^ L + log(m) + 6^1og(d)| . 

and 


\\M-Mo\\l C'rank(Mo) 


12 

mim2 p^mim2 




I (a V cr)^ L + log(m) + b^ log(d)| . 


Remark 1. Note that tt^- > p yields L > Mp. Then, the upper bound on 


the estimation error in the Theorem 5 is at least a constant times 


rank(Mo) 
pm 

So, in order to get a small estimation error, p should be larger then -^ 

m 

We denote by n = the expected number of observations. Condition 

^ rank(Mo) 


m 


implies the following condition on n 
n>C rank(Mo) M. 


(13) 


When the rank of the matrix Mq is small, this necessary number of observations 
is close to the number of degree of freedom of the matrix Mq, which is 

(mi + m 2 )rank(Mo) — (rank(Mo))^ . 

Let us restrict our attention to the non-degenerated case Mq ^ 0 (we can 
easily include this case replacing rank(Mo) by rank(Mo)V 1). Assuming that the 
expected number of observations n is not too small, we can get simpler bound 
on the estimation error. Suppose that n > c*m\og{d). Then, using 

Lm >n> c*m log d 


we get L > c* log d and we can chose A in the following way 

A = 186v^. (14) 

With this choice of A we get the following bound on the estimation error 
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Corollary 6. Let Assumptions 1 and 2 he satisfied and ||Mo||oo < o,. Assume 
that n > c*mlog(d) and Mq ^ 0. Then, with probability at least 1 — S/d, 

\\M-M4l ^ C'rank(Mo) {ay bf L 
mim2 ~ p^mim2 

In order to compare this result with previous results on noisy matrix com¬ 
pletion we consider a more restrictive assumption on the sampling distribution. 
That is, we assume that this distribution is close to the uniform one: 

Assumption 3. There exists positives constants p,i and p ,2 independent on mi 
and m 2 and a 0 < p < 1 such that for every (i,j) € {1,..., mi} x m 2 } 

we have 


fJ'2P < TTy < Hip. 

Under this assumption Theorem 2 yields 

Corollary 7. Let Assumptions 1 and 3 he satisfied and ||Mo||oo ^ Assume 
that n > mlog(d) and X given by (14). Then, with probability at least 1 — S/d, 

||M —Molll ^ Crank(Mo) [aV b)^ 
mim 2 ~ pm 

Remark 2. Let us compare the bound given by Corollary 7 with bounds avail¬ 
able in the literature. Our model was previously considered by Chatterjee 
in [S] in the case of uniform sampling distribution, that is TTy = p for any 
{i,j) £ {1, ■ • ■, wij X {1,... ,m 2 }. In [8], Chatterjee introduces a simple esti¬ 
mation procedure, called Universal Singular Value Thresholding which is applied 
to a number of questions in low rank matrix estimation, blockmodels, distance 
matrix completion, latent space models and etc. For matrix completion prob¬ 
lem and under the additional assumption p > for some e > 0, the bound 

obtained in [8] is the following one 


\\M - MqWI ^ ^ / rank(Mo) (a V hf 
mim 2 ~ y pm 

The rate of convergence given by Corollary 7 is faster and, as we will see in 
Sectiond, is minimax optimal. Note that the additional assumption p > 
yields the following condition on the expected number of observations 

n > m^M. (15) 

For low rank matrices, this necessary number of observations is larger than the 
number of observations required by our method and given by (13). 

In [20, 17, 15] a closely related set up for matrix completion problem using 
the trace regression model was considered. The main difference between these 
two settings is that in the case of the trace regression the number of observations 
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is not random and each entry may be observed multiple times. In our setting 
the number of observations is random and each entry is observed at most once. 
Comparing with Corollary 7 and using n = pmim 2 we see that bounds obtained 
in [20, 17, 15] contain an additional logarithmic factor log(TOi + m 2 ). 

4 Minimax Lower bounds 

In this section, we prove the minimax lower bound showing that the rates at¬ 
tained by our estimator are optimal. The minimax lower bound in a closely 
related problem was obtained by Koltchinskii et al in [17]. We adapt their proof 
to our set up. 

We will denote by infthe infimum over all the estimators. For any Mq G 
R™ix™ 2 ^ let Pmo denote the probability distribution of the observations 


iVii 


) 


satisfying (2). 

For any integer 0 < r < mm(mi, m 2 ) and any a > 0, we consider the class 
of matrices 


A(r,a) = {Mg . i.ank(M) < r, IjMljoo < a, } . (16) 


We will prove the lower bound in the case of the uniform sampling distribution, 
that is, we suppose that each entry is observed with the same probability p. As 
it was noted in Remark 1, in order to get a small estimation error we need to 
observe a sufficiently large number of entries, or, equivalently, the probability 
p should be larger then r/m. We prove a lower bound on the estimation risk 
when this condition is satisfied. 

Theorem 8. Suppose that mi, m 2 > 2 and p> —■ Fix a > 0 and integer 1 < 
r < min(mi,m 2 ). Suppose that the variables are i.i.d. Gaussian 

> 0, for i = l,...,n. Then, there exist absolute constants /3 G (0,1) and 
c > 0, such that 



A Proof of Theorem 2 


1. By Lemma 1 in [19], M minimizes 


h{M) = l\\Y+ - M\\l + X\\M 


Then, using the sub-gradient stationary conditions we have 
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where V S cJ||M||*. A simple calculation yields 


< |((r-Mo)o,M-Mo)| + 


-M]_,M-Me 


II 


+ A(y,Mo-M). 


Ill 


(17) 


2. We estimate each term in (17) separately. For the first term, we have 
that (Y — Mo)q = S where S = j) Then, by the duality between 

the nuclear and the operator norms, we obtain 


((F-Mo)o,M-Mo) 


< ||E||||M-Mo|U. 


(18) 


For the second term, using again the duality between the nuclear and the oper¬ 
ator norms and the stopping criteria for the Algorithm 1, we obtain 


(( 




- M 

< A/3 


/ n 

M - Mo 


M-Mo 


(19) 


3. In order to estimate the third term, we use that by monotonicity of 
subdifferentails of convex functions we have that (v — V,M — Mt^ > 0, for 
any V S 9||Mo||*. This implies 

(V, Mo - < (y, Mo-My (20) 


Let Ps be the projector on the linear vector subspace S and let be the 
orthogonal complement of S. Let Uj{A) and Vj{A) denote respectively the left 
and right orthonormal singular vectors of a matrix A. 5i(A) is the linear span 
of {uj{A)}, S 2 {A) is the linear span of {uj(A)}. We set 

^Ai^) = Pg±(^je)BPsf{A) and Pa (7?) = 7? — Pi (7?). (21) 


Since Pa(^) = Psj-(a)^Ps 2 (A) + Psi{A)B and rank(Ps,(A)S) < rank(A) we 
have that 

rank(PA(7?)) < 2rank(A). (22) 

Note that the subdifferential of the convex function A -A ||A||* is the following 
set of matrices (cf. [26]) 


rank(A) 

a||A|U=<j ^ u,iA)vf{A)+FiiW) : ||1F|| < 1 • 

1=1 


(23) 
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Inequality (19) and (23) imply 


iii<a(^ u,{Mo)vf (Mo), Mo - m\ + (Pii^iW), Mo - m) . (24) 


\i=i 


Using the fact that 


J2j=iUj{Mo)vJ {Mo) 


= 1 and 


J2uj{Mo)vJ{Mo), Mo-m\ = {^u,{Mo)vJ{Mo), Pmo (Mo - m) 


\i=i 
we obtain 


\i=i 


III< A 


Mo (Mo-m)|[ + (P]1^JIU),Mo-m). (25) 


Now, by the duality between the nuclear and the operator norms, there exists 
W with 11 bull < 1 and such that 


Pij^ {W), Mo - m) = - (W, Pi,„ (m) 


Pmo (^) 


For this particular choice of W, (25) and (26) imply 
III< a(||Pmo (Mo-m)|[ 

Putting (18), (19), and (27) into (17) and using A > 3 ||E|| we obtain 


pi.(*)IL)^ 


(26) 


(27) 


(Mo-m)^ 


2 2A 
< — 
2 3 


M -Mo 


+ a(||pmo (mo-m) 


pi 4 *)|L) 


(28) 


4. The triangle inequality and (22) lead to 


(Mo-m)^ 


2 5A 
< — 
2 - 3 


P Mn ^Mo — M^ 


Mo I JWO 


< 


5A y^2rank(Mo) 


Mo-M 


(29) 


and 


pJ- 
^ Mo 


(m) 


5A 


Inequality (30) implies 


Pmo (^) 


< 


< 5 


0 (-^0 — 


Ml 


( 30 ) 


Pmo (Mo - m) 


and 
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M — Mq < 6 Pmo — Mq) < \J72 rank(Mo) M — Mq 
* * 

5. For a 0 < r < TO we consider the following constrain set 

2 ^ log(d) 


( 31 ) 


C(r) = iAe 


= 1 , 


> 


- 0.0006 log (6/5) p 


(32) 


Note that the condition ||A||^ < is satisfied if rank(A) < r. 

We have the following result for matrices in C{r). Its proof is given in 
Appendix C. 

Lemma 9. For all A £ C{r) 

with probability at least 1 — 8/d. 


Note that condition 


M - 


< a and < a imply 


M-Mo 


< 3a. 


< 


We now consider two cases, depending on whether the matrix 
longs to the set C (72rank(Mo)) or not. 

Case 1: Suppose first that M — Mq '' °§(‘^) 

statement of the Theorem 2 is true. 

Case 2: It remains to consider the case 
1 


(m - Mo) 


be- 


L 2 (n) 0.0006 log (6/5) p 

2 log(d) 


3a 

, then the 


M- Mo 


> 


L 2 (n) 0.0006 log (6/5) p 


Then (31) implies that — (jil — MqJ G C (72 rank(Mo)) and we can apply 

Lemma 9. From Lemma 9 and (29) we obtain that with probability at least 
1 — 8/d one has 


-||M — Mo|||2(n) < 


5A ^2rank(Mo) 


Mo - M 


+ 369 a^p-^ 72rank(Mo) (E(||Efl|| 


18 


<6X^p ^rank(Mo) + 


M - Mo 


+ 369a^p-^ 72rank(Mo) (E (||E/j||))^ + 18 


Now (3) imply that, there exist numerical constants C such that 

11^ - A^ollL(n) < Cp-^ {rank(Mo) (a^ + (E {\\X:n\\)f) + a^} , 
which leads to the statement of the Theorem 2. 
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B Proof of Theorem 8 


We adopt the proof of Theorem 5 in [17] to our setting. Assume w.l.o.g. that 
mi > m 2 . For a 7 < 1, define 


C =Jl = ikj) € : /y G 0,7(aAa) 


(-)")' 

\pmj J 


VI < i < mi, 1 < j < r 


and consider the associated set of block matrices 

A = {l=(Z|--- \l\0 :Tg£ 




where O denotes the mi x (m 2 — r[m 2 /( 2 r)J) zero matrix, and [ccj is the integer 
part of X. 


Remark 3. In the case mi < m 2 , we only need to change the construction of 
the low rank component of the test set. We first build a matrix A = ( A | O ) G 

]R’'xm 2 imJigre L G /j’’x(™ 2 / 2 ) entries in ■|o, 7 (CTAa) o.nd, then, 

we replicate this matrix to obtain a block matrix L of size mi x m 2 


L = 




L 

V O ) 


By construction, any element of A as well as the difference of any two ele¬ 
ments of A has rank at most r. In addition, condition p > — implies that the 
entries of any matrix in A take values in [0, n]. Thus, A d A{r,a). 

The Varshamov-Gilbert bound (cf. Lemma 2.9 in [24]) guarantees the exis¬ 
tence of a subset A° C A with cardinality Card(Vl°) > -|- 1 containing 

the zero mi x m 2 matrix 0 and such that, for any two distinct elements Ai and 

A2 of^o. 


Pl-A2l|2> 


Mr 


7^(ct a a )^ 


pm 


m 

L r J 


>—(cr A a)^ mim 2 -. (33) 

16 pm 


Using that, conditionally on Xi, the distributions of are Gaussian, we get 
that, for any A G Ao, the Kullback-Leibler divergence A'(Po,IPa) between Pq 
and Pa satisfies 


K 


1 ,, .,,n 'y^ Mr 

= ^ll^llL2(n) < ^—• 


(34) 
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From (34) we deduce that the condition 


Card(.4°)-1 ^ ^ «log(Card(^°)-l) 


(35) 




is satished for any a > 0 if 7 > 0 is chosen as a sufficiently small numerical 
constant depending on a. In view of (33) and (35) and using the application of 
Theorem 2.5 in [24] implies 


i„t p f pLrai > £(£A£)!i) > ^ 

M MoGA(r,a) \ TOim2 pm I 


(36) 


for some absolute constants /3 S (0,1), which implies the statement of Theorem 


C Proof of Lemma 9 

This proof is close to the proof of Lemma 12 in [15]. Set 
£ = Up-^ [r(E(llS^ll))Vl8 . 

We will show that the probability of the following “bad” event is small 

^ ^ e C(r) such that || 4 lol |2 — ||4lj]^^(n) > 2 ' 

Note that B contains the complement of the event that we are interested in. 

In order to estimate the probability of B we use a standard peeling argument. 
log(d) 6 

“■*“ = 5- 

5i = e C(r) : < |A||i^(n) < a'i-j . 

If the event B holds for some matrix A G C{r), then A belongs to some Si and 


ll^nll^-milL(n)|>^ll^llL(n)+f 

> + £ 

= |aV + £. 


For T > V consider the following set of matrices 

C{r,T) = [AGC{T) : \\A\\l^^n^ < t} 


(37) 
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and the following event 




|3^eC(r, aV) 


Pn||2 


mil 


2 

i 2 (n) 



Note that A G Si implies that A G Clr,a^v). Then (37) implies that Bi holds 
and we get B C US;. Thus, it is enough to estimate the probability of the 
simpler event Bi and then apply the union bound. Such an estimation is given 
by the following lemma. Its proof is given in Appendix D. Let 


Zj’ 


sup 

AGC{r,T) 


ma|l2 


miiL(n) 


Lemma 10. We have that 


Zt > —T 
- 12 


44 p 


r(E(||SK||))mi8 




with Cl > 0.0006. 

Lemma 10 implies that F (Bi) < 4exp(—Using the union bound 
we obtain 

OO 

F{B) < UP (Si) 

OO 

< 4^E^ exp(—CipaV) 

OO 

< 4^E^ exp (—Cl p V log(a) V) 


where we used e^ > x. We finally compute for v 


log(rf) 

0.0006plog (6/5) 


^ 4 exp (-Cl pt/ log(a)) ^ 4 exp (-log(c;)) 
“ 1 — exp(—cipi/log(a)) 1 — exp (—log(d)) 


This completes the proof of Lemma 9. 


D Proof of Lemma 10 


We will start by showing that Zt concentrates around its expectation and then 
we will upper bound the expectation. Recall that by definition, 


Zt = sup 

AeC{r,T) 


(ij) \(ij) 

We use the following Talagrand’s concentration inequality : 
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Theorem 11. Suppose that f : [—1,1]^ —>■ R is a convex Lipschitz function 
with Lipschitz constant L. Let Si,... be independent random variables taking 
value in [—1,1]. Let Z : = /(Si,..., S„). Then for any t > 0, 

W{\Z -E{Z)\ > 16L + t) < 


For a proof see [22] and [8]. Let/(xn,..., : = sup J2(i i) i^ij - Pij) 

A£Cir,T) 

It is easy to see that /(xn,... ,Xmim 2 ) is a Lipschitz function with Lipschitz 
constant L = ^Jp~^T. Indeed, 

\f{xil, • ■ ■ ) Xmim2 ) /(■2'11! • ■ ■ 1 Zmim2 ) I 


sup 

Z! (^d Pij)-^ij 

— sup 

Z! (^d Pd) 

A^C{r,T) 

(id) 

AGC(r,T) 

(hi) 


< sup 
AeC(r,T) 


< sup 

AeC{r,T) 


Z! (^if Pif ) 

_ 

Z! (■^d T'd) ^d 

(id) 


(id) 


^ ' i^ij Pij)-^ij ^ ^ i^ij Pij) ^ 
(hj) (ij) 


< sup 
AeCir,T) 


^ ' (xy Zij) A. 

(id) 




AeC{r,T) 


(id) 


(id) 


< sup Y, “ "^iJ)^ /Z 


AeC{r,T) 


(id) 


(id) 


< \/Z^ /Z - ^d)" 

V (id) 


where we used ||a| — |6|| < |a — b\, ||T||oo < 1 and imii 2 (n) < Now, Theorem 
II and 2y/jP^ < T + p~'^ imply 

e(zt> IE(Zt) + 768p"i + + t] < 


Taking t = | we get 

P (^Zt > E{Zt) + 768p-i + i (38) 

with Cl > 0.0006. 
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Next we bound the expectation Ei{Zt)- Using a standard symmetrization 
argument (see e.g. [18]) we obtain 


E (Zt) = E sup 

AeC{r,T) 


{id) 


< 2E I sup 

A&C{r,T) 


{id) 


where {eij} is an i.i.d. Rademacher sequence. Then, the contraction inequality 
(see e.g. [16, Theorem 2.2]) yields 


E (Zt) < 8E I sup 

A€C{r,T) 


{id) 


= 8E sup 1(E/{,A)1 

\AeC(r.r) , 


where 'Zr = j) ^ijVd^ij- For A e C(r, T) we have that 

II^L< V^PIl2 

< \/rp 1 ||^|li2(n) 

< \/rp~^ T 

where we have used (3). Then, by the duality between nuclear and operator 
norms, we compute 


E(Zt)<8 E| sup 1(E^,A)1 1 <8^/rp-^TE(\\ZR\\). 


Finally, using 


^ + 8V^;^^E (11E«1|) < Q + It + 44rp-i (E (US^U))^ 


and the concentration bound (38) we obtain that 


Zt > -^T + 44p 


r(E(llEKl|))Vl8 




with Cl > 0.0006 as stated. 


E Proof of Lemma 1 

It is easy to see that 

ll(Mfc+i - Mfc)^l| < l](Mfc+i - < IjMfe+i - M,\\, 
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and 


Thus, it is enough to show (8). The proof of (8) is close to the proof of Lemma 
4 in [19]. 

Let us denote for by Mk the solutions produced by Algorithm 1 after soft- 
thresholding step and before truncating step (6). We have that 

||M'=+1-M'=||2 < ||M'=+1-M'=||2 < ||(M'=-M'=-^)n||2 < ||m'=-M'=-1||2 (39) 

where in the second inequality we used the following result (see, for example, 
Lemma 3 in [19]) 

Proposition 12. The soft-thresholding operator S\{-) satisfies the following: 
for any Wi , W 2 

||5a(Wi) - 5a(W' 2)||2 < IjM^l - W'2l|2. 

The inequality (39) implies that the sequence {|]M^ — M*^“^|| 2 }fc>i con¬ 
verges. It remains to show that it converges to zero. Note that the inequalities 
(39) imply that 

IjM'^ - - ||(M'=+^ - M'=)oll2 = ||(M'=+i - M^)n\\l 0. 

So, we only need to show that ||(M^+^ — M^)q1|2 —>■ 0. 

We put 


g(A, B) = ^\\{Y- B)n\\l + ^11 - BMl + A||i?||.. 

Note that (7) implies 


Q(M'=,M'=) > g(M^M'=+^) 

= i|](y - M'=+^)a||l + + A||M'=+^|U 

> \\\iY - M'=+')a||l + i||(M'=+i - M^-+i)alll + X\\M>^+% 

= Q(M'=+\M'=+1) 

(40) 


where in the last inequality we used that 


if < 


M^.+i = 1 a if Mf/i > a 


-a if M^.+i < -a. 


(41) 
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The inequality (40) shows that the sequence {Q{M^,M^)}k>i converges. This 
and (40) yield 

Q(M'=, M'^+I) - Q(M'=+\ M'=+^) 

= - M'=+i)n||i ^ 0 . 

(42) 

Now, it is easy to see that 

||(M'= - M'=+i)o||^ - ||(M'=+i - M'=+i)nl|^ > ||(M^- - M^-+i)o||^. (43) 

Indeed, for (i, j) in H, such that we have that 

(m^. - ' - (Mf/i - Mf/i)' = (T^. - Mf/')" 

and for (i, j) in fi such that ^ we have that 

(m^. - - (Mf/i - > (M^. - 

where we used (41). Now (42) together with (43) imply (8) which completes 
the proof of Lemma 1. 


F Proof of Lemma 4 


In order to prove (10), we use the following remarkable bound on the spectral 
norms of random matrices. It is obtained by extension to rectangular matrices 
via self-adjoint dilation of Corollary 3.12 and Remark 3.13 in [2] (cf., Section 
3.1 in [2]). 

Proposition 13 ([2]). Let A he the mi x m 2 rectangular matrix whose entries 
Aij are independent centered bounded random variables. Then, for any 0 < e < 
1/2 there exists a universal constant Ce such that, for every t > 0 


’ > (1 + e)2-\/2((Ti V (J 2 ) -b < (mi A m 2 ) 


exp 


-C 


Cecrz 


where we have defined 


a\ = max 

i 


(J 2 = max 


i;e[4i 




(T* = max) Aij\. 
ij 
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We apply Proposition 13 to S = J2{ij) We compute 





crmax-y^ and (72 = crmax^Tr. j. 


Bound (4) implies that cti V (T 2 < ay/L. On the other hand, Assumption 1 

implies Taax\r]ij^ij \ < b. Now, taking in Proposition 13 e = 1/2 we get (10). 
ij 

In order to prove ( 11 ) we use the following result 

Proposition 14 (Corollary 3.3 in [2]). Let A be the mi x m 2 rectangular matrix 
with Aij independent centered bounded random variables. Then, there exists a 
universal constant C* such that, 

E ||A|| < C* |cri V (72 + ( 7 * y^log(mi A TO 2 )| 

where (7i, (72,(7* are defined in Proposition 13. 

We apply Proposition 14 to S/{ = rjijCijXij where {e^} is i.i.d. Rademacher 

sequence. We have that cti V (72 < '/L and cr* < 1, then Proposition 14 implies 

( 11 ). 
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